Introduction
This project studies Netflix’s coverage of movies and TV shows, containing two parts: the general trends as well as summary text analysis.
Analysis
General Trends
In the first part, we analyze products’ change on Netflix and we look into characteristics of each country’s products, popular movie and TV shows genres, products’ duration change over time, and products’ rating guidelines. We seek to find Netflix’s characteristics in terms of a streaming platform through detailed products analysis.
History of Netflix’s TV shows and movies
This graph indicates that largely more movies are being added to Netflix than TV shows, especially from 2017 to 2020, which implies Netflix’s continued interest in expanding its movie contents, and focus on offering various genres of movie.
Netflix by Countries
We plotted an interactive world map showing each countries’ production of movies and TV shows. Upon click on a point on the above map, the map will exhibit the country name, the number of movie production, and TV show production.
We also produce this map, in which we use different colors showing each country’s total productions. Dark red represents the highest number of toal production, while dark blue represents the lowest number of total production. We can find that United States has the highest production in the world.
TOP 20 in production
Below a data table of the top 20 countries with highest total number of production in descending order.
## # A tibble: 20 × 4
## # Groups: country [20]
## country Movie `TV Show` total
## <chr> <dbl> <dbl> <dbl>
## 1 "United States" 2364 847 3211
## 2 "India" 927 81 1008
## 3 "United Kingdom" 382 246 628
## 4 " United States" 388 91 479
## 5 "Canada" 187 84 271
## 6 "Japan" 85 174 259
## 7 "France" 148 64 212
## 8 "South Korea" 47 164 211
## 9 " France" 155 26 181
## 10 "Spain" 129 52 181
## 11 " United Kingdom" 152 26 178
## 12 " Canada" 132 42 174
## 13 "Mexico" 86 48 134
## 14 " Germany" 104 19 123
## 15 "Australia" 61 56 117
## 16 "Egypt" 98 14 112
## 17 "Turkey" 81 30 111
## 18 "Germany" 78 25 103
## 19 "China" 59 41 100
## 20 "Nigeria" 87 9 96
TOP 20 Production Types
United States is the country which have most productions. India’s total productions is the second place, but such ranking primarily depends on the number of its movie production.
Frequent Categories
Top 3 genres are the same for both Movies and TV shows. Movies are more intended for adults, while TV shows are more intended for teens and kids, as there are a spefic genre of shows produced specifically for children.
Duration
This plot shows the distribution of duration for movies and TV shows. The majority of movies lasts around 100 minutes, and the majority of TV shows has one season.
Duration changes
Below is a visualization of duration change over time for top 5 genres of each type. Dots are movies and triangles are TV shows.
For movies, there is a downward sloping trend that today’s movies tend to have a shorter duration. For TV shows, there is a upward sloping trend that an increasing number of tv shows has longer duration. The plots of different TV shows genres are more scattered than that those of movies.
Rating Distribution
Below is a bar graph of rating distribution of both movies and TV shows.
Most movies and TV shows listed in Netflix have a TV Parental Guidelines of TV-MA, suggesting the content is for mature audiences and may not suitable for age 17 and under. A number of movies also has a rating of R, suggesting that the content is restricted and may be inappropriate for ages 17 and under.
TOP 20 vs. Rating Categories
Below is a graph showing the proportion of each rating categories for the top 20 countries.
We can observe that most of India’s production is TV-14, suggesting that their productions are for audiences older than 14, while most of the United States’ productions are TV-MA, telling that US productions are for audiences older than 17. The content of the United states’ productions is likely to include more violence, sex, adult language, nudity, or substance use.
Directors
Below is an interactive map that shows each country’s number of directors. Upon clicking on a point on the map will show the country name and number of directors.
Different colors indicates different number each country has with dark red highlighting the country with more than 3000 directors, dark blue highlighting the country with less than 150 directors and grey suggest that this country has no directors in this dataset.
Conlcusion
In the global industry of online entertainment, Netflix has been a market leader in providing video streaming services, with 183 million paid members in over 190 countries and a wide variety of genres in different languages. After analyzing the traits of movies and TV shows, as well as the total number of production across country over time, we conclude that Netflix is contentiously trying to increase productions, and generate news contents of various genres as growth strategy to maintain existing subscribers and attract new audiences.
NLP Topics
Beyond examining the distribution of production on Netflix TV series and Movies, this project will aim to provide insight on key characteristics that contribute to being a “high rated” show (including both TV series and Movies). More specifically, the second part of the project will utilize natural language processing to look at word cloud for show summaries, compare most frequently appeared words between all shows and high rated shows, compare Flesch Kincaid score with ratings, as well as with the subcategories of TV series and Movies. The purpose of this project is to provide an overview of the dynamic on Netflix and possible suggestions that could increase a shows probability of having a high IMDb score.
Summaries Word Cloud
First, we decide to define a high IMDb rating standard.
## Title Genre Tags Languages
## Length:15480 Length:15480 Length:15480 Length:15480
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## Series.or.Movie Hidden.Gem.Score Country.Availability Runtime
## Length:15480 Min. :0.600 Length:15480 Length:15480
## Class :character 1st Qu.:3.800 Class :character Class :character
## Mode :character Median :6.800 Mode :character Mode :character
## Mean :5.938
## 3rd Qu.:7.900
## Max. :9.800
## NA's :2101
## Director Writer Actors View.Rating
## Length:15480 Length:15480 Length:15480 Length:15480
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## IMDb.Score Rotten.Tomatoes.Score Metacritic.Score Awards.Received
## Min. :1.000 Min. : 0.00 Min. : 5.00 Min. : 1.000
## 1st Qu.:5.800 1st Qu.: 38.00 1st Qu.: 44.00 1st Qu.: 1.000
## Median :6.600 Median : 64.00 Median : 57.00 Median : 3.000
## Mean :6.496 Mean : 59.52 Mean : 56.81 Mean : 8.764
## 3rd Qu.:7.300 3rd Qu.: 83.00 3rd Qu.: 70.00 3rd Qu.: 8.000
## Max. :9.700 Max. :100.00 Max. :100.00 Max. :300.000
## NA's :2099 NA's :9098 NA's :11144 NA's :9405
## Awards.Nominated.For Boxoffice Release.Date
## Min. : 1.00 Length:15480 Length:15480
## 1st Qu.: 2.00 Class :character Class :character
## Median : 5.00 Mode :character Mode :character
## Mean : 13.98
## 3rd Qu.: 12.00
## Max. :386.00
## NA's :7819
## Netflix.Release.Date Production.House Netflix.Link IMDb.Link
## Length:15480 Length:15480 Length:15480 Length:15480
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## Summary IMDb.Votes Image Poster
## Length:15480 Min. : 5.0 Length:15480 Length:15480
## Class :character 1st Qu.: 403.5 Class :character Class :character
## Mode :character Median : 2322.0 Mode :character Mode :character
## Mean : 42728.4
## 3rd Qu.: 20890.5
## Max. :2354197.0
## NA's :2101
## TMDb.Trailer Trailer.Site
## Length:15480 Length:15480
## Class :character Class :character
## Mode :character Mode :character
##
##
##
##
Through the summary as well as the boxplot, I decide to use 7.3 as a standard for high IMDb score, because it is above 3rd percentile.
All shows
After all the necessary steps such as cleaning and stemming, we create a word cloud that shows all shows summaries on Netflix. The most common word used is “life”.
High rating shows
When only looking at shows with high rating, the word cloud appears to be similar as the graph before, which looks at all shows. “life” is still the most frequently appeared word, with a few other words overlapping, such as “young”, “friend”, “love”,“man”. The only difference between the top 10 common words is that “woman” is top 10 common in all shows, while “world” is top 10 common in high rating shows. This might suggest that the shows not mentioning “woman” in the summary are more likely to receive higher score. Taking account that according to IMDb’s official statistics that show breakdown of ratings by gender, almost every show receives about 5 times more rating counts from male users, the platform is extremely male orientated. This word cloud might suggest that because the platform have more male users, these users tend to give high score for shows that doesn’t show the word “woman” and prefer non-feminine words like “world”.
Series and Movies comparison
Then, we decide to create a comparison and a commonality cloud showing the most-frequent series and movies words
Words like “life”, “young”, “family”, “love”, and “new” are high frequent and common words among both series and movie summaries.
Movie summaries have more pronouns, such as “girlfriend”, “mother”, “son”, “wife”, and “father”; while series summaries tend to have verbs, such as “follow”, “navigate”, “show”, “explore”, and “host”.
Summary Words and Ratings
We then created a pyramid plot to show the words between movies’ description with high rating and low rating differ in frequency
## 1137 1137
Due to there are more low rating shows than high rating shows, high rating shows naturally have less word frequencies. So when high rating shows have specific words that appear more than low rating shows, the percentage is significantly larger. Thus it can be said that for words like “footage”,“python”, “chronicle”, are more likley to appear in high rating shows.
Summary Readability & Ratings
## document Flesch Flesch.Kincaid meanSentenceLength meanWordSyllables
## 1 text1 51.25 11.53 21.8 1.577
Knowing that the higher the FRE score, the easier to understand; and the lower the FRE score, the harder to understand. A 0-30 score range is usually for college graduates, which are very difficult to read and best understood by university graduates.
Treating all summary text as a single document, we receive a 11.53 Flesch Kincaid Score, 21.8 Average Sentence Length for each summary and 1.577 Average Word Syllables.
FRE vs. IMDb Score
## NULL
From the graph above it seems like there is no obvious correlation between Flesch Kincaid score and IMDb Score as most of the Flesch Kincaid score is concentrated in the middle. However it could be seen that when the IMDb score is over 6, a few Flesch Kincaid score appears out of the crowed to be higher than average. The phenomenon, yet, diminishes when the IMDb score is above 9.
A useful insight from this graph is that for low rating shows below 3, the Flesch Kincaid score is for sure below 15. So having higher Flesch Kincaid score will increase the chance of receiving IMDb score above 3, however it doesn’t garuantee how high the IMDb score can reach.
In order to draw more insights from this relationship, we decide to take a closer look by categorizing movies and series. Of the movies and series with a 10-15 FRE scores, it is more likely for them to receive a higher IMDb Score.
FRE vs. IMDb Votes
The 10-15 range of summary FRE score tend to have the potential to receive a high IMDb Votes. It is also fair to say that high quality production movies or series tend to write their summaries at a range of 10-15, which is for college graduates level.
FRE vs. High Ratings
## NULL
For these high rating shows, the Flesch Kincaid score ranges a lot. Interestingly, the Flesch Kincaid score starts to decrease when the IMDb score is above 8.5. The lower limit for shows with rating above 9 also starts to increase, shrinking the range of Flesch Kincaid score. It can be interpreted that Flesch Kincaid score ranging from 7~17 is more likely to receive IMDb score above 9.
FRE vs. Release Date
Towards recent years, the FRE Scores start to expand: from largely around 10-15 in year 1980 to a range of 5-10 in year 2020.
Conclusions
To conclude on the second part of the project, a few characteristics that are associated with high rating shows on Netflix can be summarized as following: The summary of the show is less feminine without the word “women”, include words such as “documentary”, “footage”, “chronicle”, “python”, have Flesch Kincaid score above 15 to receive above average IMDb score and below 17 to have a high IMDb score, and movies with high Flesch Kincaid score are more likely to receive a high IMDb score. These information could be utilized by future producers to gain higher ratings on Netflix.